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Abstract 

Background: Transcription factors (TFs) and microRNAs (nniRNAs) are primary metazoan gene regulators. Regulatory 
mechanisnns of the two main regulators are of great interest to biologists and may provide insights into the causes of 
diseases. However, the interplay between miRNAs and TFs in a regulatory network still remains unearthed. Currently, it 
is very difficult to study the regulatory mechanisms that involve both miRNAs and TFs in a biological lab. Even at data 
level, a network involving miRNAs, TFs and genes will be too complicated to achieve. Previous research has been 
mostly directed at inferring either miRNA orTF regulatory networks from data. However, networks involving a single 
type of regulator may not fully reveal the complex gene regulatory mechanisms, for instance, the way in which a TF 
indirectly regulates a gene via a miRNA. 

Results: We propose a framework to learn from heterogeneous data the three-component regulatory networks, with 
the presence of miRNAs, TFs, and mRNAs. This method firstly utilises Bayesian network structure learning to construct 
a regulatory network from multiple sources of data: gene expression profiles of miRNAs, TFs and mRNAs, target 
information based on sequence data, and sample categories. Then, in order to produce more meaningful results for 
further biological experimentation and research, the method searches the learnt network to identify the interplay 
between miRNAs and TFs and applies a network motif finding algorithm to further infer the network. 
We apply the proposed framework to the data sets of epithelial-to-mesenchymal transition (EMTl.The results 
elucidate the complex gene regulatory mechanism for EMT which involves both TFs and miRNAs. Several discovered 
interactions and molecular functions have been confirmed by literature. In addition, many other discovered 
interactions and bio-markers are of high statistical significance and thus can be good candidates for validation by 
experiments. Moreover, the results generated by our method are compact, involving a small number of interactions 
which have been proved highly relevant to EMT. 

Conclusions: We have designed a framework to infer gene regulatory networks involving both TFs and miRNAs from 
multiple sources of data, including gene expression data, target information, and sample categories. Results on the 
EMT data sets have shown that the proposed approach is able to produce compact and meaningful gene regulatory 
networks that are highly relevant to the biological conditions of the data sets. This framework has the potential for 
application to other heterogeneous datasets to reveal the complex gene regulatory relationships. 



Background 

The regulation of gene expression is a critical mechanism 
in the control of biological processes in cellular organisms. 
At the transcriptional level, the main regulators contribut- 
ing to the control are transcription factors (TFs), proteins 
that bind to cis-regulatory elements in the gene promoter 
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regions [1]. By activating or repressing their target genes, 
TFs can regulate the global gene expression program of a 
living cell, and form transcriptional regulatory networks 
[2-4]. 

Recent studies have identified that microRNAs 
(miRNAs) play an important role in gene regulation at the 
post-transcriptional level. The regulation process takes 
place via mRNA cleavage or translational repression, with 
miRNAs binding to the 3'-untranslated regions (3'-UTRs) 
of target mRNAs through base pairing to complemen- 
tary sequences [5-8]. It has also been demonstrated in a 
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body of literature that miRNAs regulate a wide range of 
biological processes in proliferation [9,10], metabolism 
[11,12], differentiation [13], development [14,15], apop- 
tosis [12,16,17], cellular signaling [18] and even cancer 
development and progression [7,19,20]. 

It is necessary to draw a unified picture for the reg- 
ulatory relationships between TFs, miRNAs and genes. 
However, a challenge is that the combined regulations 
of miRNAs and TFs are complicated, since they involve 
not only the interactions between each regulator and 
their target genes, but also the interactions between the 
regulators themselves. Studies of the gene regulatory net- 
works with the presence of both TFs and miRNAs will 
help elucidate the regulatory mechanisms involving both 
direct and indirect regulatory relationships. However, It 
is still highly unlikely for these relationships to be dis- 
covered by biological experiments directly, as the process 
would be extremely costly and time consuming. On 
the other hand, well-designed computational approaches 
may facilitate the understanding of such complex 
relationships. 

Previously, researchers studied the co-regulation of TFs 
and miRNAs by finding out their shared downstream tar- 
gets [21,22]. The methods used probabilistic models or 
statistical tests to measure the significance of the shared 
targets between the regulators, and to remove the insignif- 
icant co-regulating interactions that occurred by chance. 
Gene enrichment analysis was used in [23] to identify 
significant co-regulation between the transcriptional and 
post- transcriptional layers. They found that some bio- 
logical processes emerged only in co-regulation and that 
the disruption of co-regulation may be closely related to 
cancers, suggesting the importance of the co-regulation 
of miRNAs and TFs. In [23] available predicted targets 
databases are used to construct the network, and then 
Gene Ontology (GO) was used to discover the significant 
functional co-regulation pairs. Tran et al. [24] proposed a 
rule based method to discover the gene regulatory mod- 
ules that consist of miRNAs, TFs, and their target genes 
based on the available predicted target binding informa- 
tion. Le Bechec et al. [25] integrated available target pre- 
diction databases to construct a regulatory network that 
involves miRNAs, TFs, and mRNAs. This work provides 
a good resource for exploring the regulatory relationships 
or identifying the network motifs. However, target predic- 
tion based on sequences have high rate of false discoveries, 
which affect the quality of the discoveries of the above 
mentioned methods. It would be ideal if expression data 
can be used to refine the discoveries. 

Roqueiro et al. [26] proposed a method to identify 
the key regulators (miRNAs or TFs) of pathways. The 
method used Bayesian inference on known pathway struc- 
tures to infer a set of regulators in the pathway network. 
The Bayesian network in this method was constructed 



manually using the known KEGG pathways by remov- 
ing the cycles in the pathways and applying some filtering 
criteria. The method drew findings based on existing 
knowledge and provided a good resource for other meth- 
ods to validate their results. However, it may not be good 
for exploratory study. 

Recently, Huang et al. [27] developed a web tool 
(mirConnX) for constructing the regulatory networks that 
include miRNA, TFs, and mRNAs. The built networks 
can be further analysed to identify network motifs. The 
method has used both predicted targets and expression 
data to build the network. The method integrated the 
association network based on expression data and the 
prior network based on sequence data. However, an edge 
in this network shows association, which may not indi- 
cate a regulation relationship. A strong association of A 
and B may be a result of a common regulator which regu- 
lates both A and B. Zacher et al. [28] proposed a Bayesian 
inference method based on expression data to explain the 
activity of miRNAs and TFs. However, this approach does 
not take into account the interactions between miRNAs 
and TFs. 

In this paper, we present a framework to construct the 
complex regulatory network with three components: TFs, 
miRNAs, and target genes. Our approach aims to dis- 
cover the regulatory relationships of miRNAs and TFs 
on their target genes respectively, as well as the interplay 
between the two different types of regulators. The method 
utilises multiple sources of data, including gene expres- 
sion data, target information of each regulator based on 
sequence data, and sample categories (conditions). To test 
the proposed method, we use the expression data from 
the NCI-60 panel of cell lines [29], and investigate the 
interactions that may involve in the biological process of 
epithelial- to-mesenchymal transition (EMT). 

Methods 

Notation and definitions 

Consider three expression data sets profiling K miRNAs, 
/ TFs, and / mRNAs across S samples, respectively. Let 
X = {xi(},y = {3//}, z = {zj} be the vectors of miRNAs, TFs, 
and mRNAs, respectively, where 1 < /c < /C, 1 < / < /, 
and 1 < 7 < /. Each sample is labelled by its category, i.e. 
the biological condition of the samples, such as cancer or 
normal. 

In this paper, our goal is to discover the interactions 
between y, z {x and y are regulators) supported by the 
expression data and under the constraint of target infor- 
mation (see Figure 1). Target information for a regulator is 
the interactions between the regulator and the regulated 
genes that are predicted based on the sequence data. We 
are particularly interested in the interactions between x 
and y (called the interplay of miRNAs and TFs), and net- 
work motifsy which are patterns of subgraphs that recur at 
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Figure 1 iVlethod overview. The method utilises Bayesian networl< learning and graph search to produce a three-component regulatory network 
(miRNA-TF-mRNA) from multiple sources of data. Target information is used to create the initial structure representing the interactions between 
miRNA-mRNA, miRNA-TF, TF-miRNA, TF-TF, and TF-mRNA. For illustration, we only draw one bipartite graph in the initial structure. Expression 
profiles are then used in the Bayesian network learning procedure to construct the networks in each sample condition. Bootstrapping and 
averaging procedure is used to integrate the Bayesian networks learnt from each condition into the integrated global network. The interplays 
between miRNAs and TFs and the network motifs that involve at least 2 regulators are extracted from the global network and are final results. 



frequencies much higher than those found in randomised 
networks [2]. 

Method overview 

In the remaining parts of the Methods section, we 
present our framework for constructing the regulatory 
network with the co-existence of both regulators, TFs 
and miRNAs. The method aims to produce regulatory 
networks including miRNAs, TFs, and genes that are 
relevant to diseases. The overall process is shown in 
Figure 1. 

There are three main steps in the framework: (1) 
data preparation, (2) network learning and integration, 
and (3) network inferences. In Step (1), we prepare 
the input for the network structure learning, includ- 
ing collecting target information for TFs and miRNAs, 
normalising expression data, and analysing differentially 



expressed genes. At the beginning of Step (2), the target 
information is transformed into the 5 types of network 
sub-structures (miRNA-mRNA, miRNA-TF, TF-miRNA, 
TF-TF, and TF-mRNA), which are used as the ini- 
tial structure for the Bayesian network learning process 
(refer to Figure 1). Additionally the expression datasets 
are split according to sample conditions. The initial 
structure are evaluated based on the expression pro- 
files in each condition. The Bayesian networks learnt 
under each condition are integrated using a bootstrap- 
ping and averaging procedure. Therefore the result of Step 
(2) is an integrated global network with three compo- 
nents: miRNAs, TFs, and mRNAs. In the network infer- 
ence step (Step (3)), we search the global network for 
the subgraphs that show the interplay between miRNAs 
and TFs, and network motifs that involve at least two 
regulators. 
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In the following, we describe each of the three steps in 
detail. 

Step (1): Data preparation 

Epithelial-to-mesenchymal transition (EMT) is part of the 
process of tissue remodeling during embryonic develop- 
ment and wound healing [30], and during carcinogenesis 
[31] when cancer cells undergo a change transforming into 
a more invasive tumor [30,32]. 

After EMT induction, cells lose their epithelial features 
characterised by the high E-cadherin expression level, and 
acquire mesenchymal characteristics, including Vimentin 
filaments and a flattened phenotype. By expressing pro- 
teases, cells become more invasive, and they can pass 
through the underlying basement membrane and migrate. 
These are crucial steps in the multistep process of metas- 
tasis [33]. 

Data used in this study contain miRNA expression pro- 
files for the NCI-60 panel of 60 cancer cell lines obtained 
from So et al. [34] (available at [http://www.ncbi.nlm. 
nih.gov/geo/query/acc.cgi?acc=GSE26375]). The mRNA 
expression profiles for NCI-60 are downloaded from 
ArrayExpress [http://www.ebi.ac.uk/arrayexpress], acces- 
sion number E-GEOD-5720. Cell lines categorised as 
epithelial (11 samples) and mesenchymal (36 samples) are 
used in this work. 

In order to identify all the TF genes in the data sets, we 
use the list of TF repertoire mined from [1]. This list is 
then used to query against the mRNA expression profiles 
from NCI-60 to extract TF expression profiles. 

After normalising the expression data of miRNAs, TFs, 
and mRNAs, diff^erentially expressed gene analysis is con- 
ducted respectively to all the three components, TFs, 
miRNAs, and mRNAs. The differentially expressed genes 
between epithelial and mesenchymal samples are identi- 
fied using the limma package of Bioconductor [35] with 
the Benjamini and Hochbergs (BH) correction method 
[36]. 148 probes of TFs, and 2251 probes of mRNAs are 
identified as differentially expressed at significant levels 
(adjusted /7-value< 0.1). Also 43 probes of miRNAs are 
identified with adjusted /7-value< 0.01. The reason for 
choosing adjusted 0.01 as the cut-off for miRNA differen- 
tially expressed analysis is that the B statistic value out- 
put from limma changes the value significantly between 
adjusted /7-value< 0.01 and adjusted j?-value> 0.01. This 
is a good sign for using the value of 0.01 as a cut-off, 
and the number of miRNAs obtained with this cut-off is 
also reasonable for analysing the results. (The details of 
differentially expressed genes are in Additional file 1). 

The data input to the Bayesian network learning in 
the next step is the expression profiles of three com- 
ponents, miRNAs, TFs, and mRNAs. To integrate the 
data profiled from different platforms, we discretise the 
expression values of each gene in each sample to binary 



values (standing for up-regulation and down-regulation) 
by using the median of each array as the cut-off. 

Another input to the Bayesian network learning is the 
target information, which is used as the prior knowl- 
edge (initial Bayesian network structure) to reduce the 
search space in the learning. miRNA targets and TF tar- 
gets are collected via commonly used databases. These 
databases usually predict target genes using sequence 
data. In this paper, we are particularly interested in the 
information of TFs targeting both mRNA and miRNA 
genes, and the miRNAs targeting mRNA and TF genes. 
We use TRANSFAC 9.3 [37] and the promoter database 
[38] to generate TF target information. TF target infor- 
mation for TF-mRNA and TF-TF pairs is obtained from 
CRSD [39], the database utilising and integrating six well- 
known large scale databases, including TRANSFAC 9.3 
and promoter databases. To obtain the TF-miRNA tar- 
get information, we use MIR@NT@N downloaded from 
[25]. Meanwhile miRBase V5.0 [40] from the Bioconduc- 
tor package RmiR.Hs.miRNA 2.11 is used to build the 
putative target for miRNAs. The detailed results of all 
target information are shown in the Additional file 2. 

Step (2): Bayesian networks structure learning and 
integration 

Bayesian network structure learning has been widely used 
for discovering gene-gene interaction networks [41], but 
not often for the discoveries of the interactions between 
regulators and their target genes. To represent the interac- 
tions between regulators (miRNAs and TFs) and between 
the regulators and their target genes in a Bayesian net- 
work, regulators and target genes are denoted by nodes 
and regulatory interactions are denoted by directed edges. 
When the expression data of regulators and target genes 
are given, we can use Bayesian network structure learning 
to discover the interactions. To start the learning process, 
we use the target information of regulators to initialise the 
search space. Therefore in this step, we take the expression 
profiles and target information as the input to produce a 
network that indicates the interactions between miRNA- 
TF, miRNA-mRNA, TF-miRNA, TF-TF, and TF-mRNA. 
The following four sub-steps are carried to obtain the 
network. 

Step (2. 1): Sample splitting 

To explore all possible interactions including up-, down-, 
and mix- regulations (up-regulation in one condition and 
down-regulation in the other) in different biological con- 
ditions, in [42] we developed the "splitting and averaging" 
strategy for Bayesian networks structure learning. This 
strategy splits samples in a data set by their categories of 
biological conditions. Bayesian network structure learn- 
ing is used to learn the networks from the samples of 
each condition respectively, and these networks are then 
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integrated or averaged into a single network. In our cur- 
rent problem, we firstly use this strategy to learn the 
networks for the epithelial and mesenchymal conditions 
separately, then obtain the integrated network from the 
networks learnt under the two conditions. So in this sub- 
step, we split each of the three expression datasets accord- 
ing to sample conditions, 11 samples in epithelial and 36 
samples in mesenchymal (conditions 1 and 2 in Figure 1 
respectively). 

Step (2.2): Creating the initial structure 

To learn a Bayesian network with n variables or nodes, 
the search space, if not restricted, will be all the possible 
networks formed with the variables. It has been shown 
that finding the best network from all the networks is NP- 
hard [43]. Therefore in this paper, we assume that the 
regulatory relationships between regulators and their tar- 
get genes form a bipartite graph. This assumption reduces 
the search space significantly. Moreover, we use target 
information to initialise the network structure and the 
topology of the network structure is further constrained. 
Therefore, we are able to discover the optimal solutions 
using the exhaustive search method in the given search 
space. Graphically, the target information can be repre- 
sented using bipartites as illustrated in Figure 1. There are 
5 types of such bipartites or sub-structures, corresponding 
to our initial knowledge of the interactions of miRNA- 
TF, miRNA-mRNA, TF-TF, TF-miRNA and TF-mRNA. 
These bipartites are used as the initial structure for the 
Bayesian network learning. 

Step (2.3): Bayesian networl^ learning process 

Each interaction in the initial structure is evaluated based 
on the expression data, and the high-confidence interac- 
tions are retained. The learning process searches through 



all possible candidate structures and evaluates the inter- 
actions with a Bayesian scoring function. The candidate 
structures are generated by removing edges from the ini- 
tial structure one by one. The scoring function measures 
the degree of fitness of any candidate structure G to the 
dataset. The goal is to select the structure that best fits the 
data. In other words, we need to calculate the probability 
of each candidate structure G given the data A P(G\D), 
According to Bayes' theorem, we have: 



P(G\D) 



P(G)P(D\G) 
TO 



In the above formula, the term P(D) is the same for 
all candidate structures. Regarding the term P(G), it is 
quite common to assume a uniform distribution [44], and 
thus it is a constant. Therefore, for comparative purposes, 
it is sufficient to calculate only P{D\G) for the scoring 
function. In this paper, we use the BDe (Bayesian metric 
with Dirichlet priors and equivalent) scoring function 
developed by Heckerman et al. [45] in the following. 



r(ivf)) 



ScoresiD. G) = P(D|G) = ff^^^ fl^^ ,^7^ 



where: 



n is the number of variables (nodes), Xi,X2, . . . ,Xn, 
qi is the number of different instantiations of the 
parent of a variable Xi in G, 
ri is the number of possible values of Xi in G, 



a b c 



Score = -4.6 


® © 

Score'' = -6.2 


Selected 
structure 

© — o 

AverageScore = -5.4 


Score°RiGi = -5.6 


® 

Score^RiGi = -6.8 



Figure 2 An example of Bayesian networl< structure learning, (a) The initial structure corresponding to regulator R] and target gene G] . 
This initial structure is created based on target information of the regulator, (b) All possible candidate structures generated from the initial structure. 
The interaction between R] and G] in each candidate structure is evaluated using a scoring function, (c) The presence or absence of an edge 
between R] and G] based on the average score of each case. The candidate structure with higher average score will be chosen. 
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a^jj^^ are the hyperparameters for the Dirichlet prior 
distributions of the parameters given the networl< 
structure. 

s.jj^ are the corresponding observations from data, 



(G) 



I) 



More details of the Bayesian scoring function can be 
found in [45,46]. In practice, we use the Bayes Net tool- 
box for Matlab (BNT) [47], and the BDe scoring function 
is included in BNT. In the next step (Step (2.4)) we eval- 
uate the confidence levels of the interactions output from 
the Bayesian network structure learning. 

For illustration purpose, consider the learning proce- 
dure for the interaction between a regulator Ri and its 
target gene Gi. Assume that in total Ri has two targets 
and the corresponding initial structure is in Figure 2a. The 
interactions in each of the four possible candidate struc- 
tures (see Figure 2b) can be evaluated with the scoring 
function based on expression data. The scores will decide 
if there is no edge between i^i and Gi or an edge between 
them. In this example, there are two structures with an 
edge between Ri and Gi, and two structures with no edge 



between them. The average score in each of the two cases 
is calculated and the structure with higher average score 
((-4.6-6.2)/2=-5.4) is chosen (Figure 2c). 

Step (2.4): Integrating and bootstrapping 

It is common to have small number of samples in the 
dataset of a typical microarray experiment, which unfor- 
tunately cannot support statistically significant discov- 
eries. To overcome this problem, bootstrapping [48] is 
usually used to improve the confidence of discoveries. 
Especially, in Bayesian network structure learning, boot- 
strapping can be combined with model (structure) average 
procedure to discover the interactions with high confi- 
dence. In this paper, the averaging procedure is firstly 
applied to the Bayesian network learning process across 
different candidate structures. This procedure is then 
applied to the sample splitting step across different sam- 
ple conditions to calculate the average score for each 
interaction. Next, the score of each interaction is aver- 
aged over the number of bootstrapping, and the confi- 
dence levels are estimated based on a statistical model 
as illustrated in the next paragraph. The interactions 
with high confidence (/?-value< 0.05) are selected to 




Figure 3 The interplays between TPs and miRNAs - miRNAs regulate TPs. miRNAs are coloured in red and TFs are in green. Tine confirmed 
edges are liigliliglited witli bold lines. All of the nodes in the confirmed interactions are EMT bio-markers. They are the miR-200 family, ZEBl , ZEB2, 
and SNAI2 (SLUG). The miR-200 family that regulates ZEBl , ZEB2 for EMT has been confirmed by literature. 
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form the integrated network (called global network in the 
paper) 

Consider again the example about learning the inter- 
action between R\ and G\, Let n be the number of 
bootstrapping iterations, qc be the event of learning the 
interaction on the local data set Dc of the c^^ condition 
(c G {1, ...,C}). As there are only two possible cases 
of interactions between Ri and Gi, we approximate the 
whole learning process of the interaction between Ri and 
Gi as a Bernoulli experiment. We denote = 1 when 
there is an edge between R\ and G\ (otherwise qc = 0), 
and assume that p(qc = 1) = p(qc = 0) = 0.5. qc fol- 
lows a binomial distribution qc ~ B{n,p), as the samples 
drawn with replacement in the bootstrap are independent 
[49]. At the integration stage by averaging, the interac- 
tions from local data sets Dc are aggregated, and the 
interactions of the regulator i^i and its target Gi learnt 
through multiple data sets for the C different conditions 
(denoted as Qr^Gi = Xlc^c) also follows a binomial 
distribution Qr^Gi ~ B{Cn,p), Adopting this statisti- 
cal model, we are able to extract the learnt interactions 
at significant levels. The interaction that has the confi- 
dence level greater than the threshold will be included 
into the integrated global network. The Matlab codes 
for the whole process is available on request, and the 



results for the integrated global network is in Additional 
file 3. 

Step (3): Network inference 

A challenging problem of Bayesian network learning is the 
complexity of the resulting networks. Bayesian network 
learning usually produces a large number of interactions, 
including false discoveries. In this step, we extract from 
the global network learnt in the previous step the interplay 
between miRNAs and TFs. We search the learnt global 
network for all of the interactions between miRNAs and 
TFs. The resulting interplay network will help elucidate 
the complex regulatory mechanism in specific biological 
conditions. 

In addition to discovering the interplay between miR- 
NAs and TFs, we use the network motif finding algorithm, 
CyclusSD [50], to discover the network motifs that involve 
at least 2 regulators. Network motifs are patterns of sub- 
graphs that recur at frequencies much higher than those 
found in randomised networks [2]. The randomised net- 
works satisfy the following criteria: 1) they have the same 
number of nodes as in the real network, 2) each node in 
a randomised network has the same number of incom- 
ing and outgoing edges as the corresponding node has 
in the real network, 3) the randomised networks used to 




Figure 4 The interplays between TFs and miRNAs - TFs regulate miRNAs. The confirmed edges are highlighted with bold lines. All of the nodes 
in the confirmed interactions are EMT bio-markers. SNAI2 (SLUG) regulates miR-200 family to indirectly control ZEBl and ZEB2 for activating EMT 
regulation procedure. These interactions have been confirmed by literature. 
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calculate the significance of n-node subgraphs were gen- 
erated to preserve the same number of appearances of 
all (n — l)-node subgraphs as in the real network. These 
criteria ensure the randomised networks have the similar 
structure to the real network, and ensure that a high- 
significance pattern is assigned not because it has a highly 
significant sub-pattern [2]. 

The resulting motifs can be considered as simple build- 
ing blocks from which the network is composed [51], and 
are believed to have specific functions which play criti- 
cal roles in biological network inference [52] . The results 
presented in the next section show that this network 
inferences step can produce a set of interactions and 
molecules which are highly relevant to the biological con- 
dition of the EMT datasets. 

Results 

The output of the method are two types of networks: 
1) the interplays between miRNAs and TPs, with their 
details shown in Figure 3 and Figure 4; 2) the results 



of network motif finding, which are the Feed For- 
ward Loops (FFLs) that involve at least two regulators 
(see Figure 5). 

From Figures 3, 4, and 5, we can see that the results 
generated by our method are compact with only a small 
number of interactions. These interactions have been 
shown to be highly relevant to the biological conditions 
of EMT, and also several EMT bio-markers which have 
been confirmed by literature are identified by our method. 
In the rest of this section, firstly we present the inter- 
actions and bio-markers that have been confirmed from 
literature, then we describe the enrichment analysis we 
have conducted to show the relevance of identified genes 
to EMT. 

Confirmed interactions and bio-markers for EMT 

Previous studies [33,53,54] have demonstrated that the 
miR-200 family targets the E-cadherin transcriptional 
repressors zinc finger E-box binding homeobox l(ZEBl) 
and ZEB2 for EMT. These results have confirmed the 




Figure 5 Network motif. Feed Forward Loop network motifs are extracted from the integrated global network. The genes identified in these 
network motifs are significantly enriched for biological functions of EMT. 



Le etal. BMC Bioinformatics 201 3, 14:92 
http://www.bionnedcentral.conn/l 471 -21 05/1 4/92 



Page 9 of 13 



interactions found using our method (shown in Figure 3), 
where we see that the hsa-miR-200 family (miR-200a, 
miR-200b, miR-200c, miR-429) regulates both ZEBl and 
ZEB2. These interactions are the important interac- 
tions that are involved in the process of inhibition and 
induction of EMT. Figure 6 shows the process in detail 
where genes identified by our method are marked with 
red bars. 



Apart from the miR-200 family, several important tran- 
scription factors that act as the bio-markers for EMT 
are also confirmed by our method. The two transcrip- 
tion factors, ZEBl and ZEB2, which are regulated by the 
miR-200 family, are the markers in all three subtypes of 
EMT [55]. Another transcription factor that plays a cru- 
cial role in EMT is SNAI2 (SLUG). In fact, all known EMT 
events during development, cancer, and fibrosis appear 





Figure 6 The pathway of development_microRNA-dependent inhibition of EMT. Genes identified by our metliod are marl<ed witli red bars. 
miR-200 family regulates ZEBl and ZEB2 in the process of inhibition and induction of EMT. These interactions are also identified by our method. 
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to be associated with SNAI activation [56]. Our results 
suggest that SNAI2 indirectly regulates ZEBl and ZEB2 
by regulating the miR-200 family transcript (Figure 4), and 
in turn the miR-200 family regulates ZEBl and ZEB2 (see 
Figure 3). This result is consistent with the literature as 
SNAI2 is confirmed to regulate the miR-200 family [57]. 
The other EMT bio-marker identified by our method is 
ETSl (see Figure 4). It has been suggested that ETSl is an 
upstream regulator of ZEBl and ZEB2 [58], and therefore 
plays a critical role in activating the regulation of EMT. 

Functions of identified genes being highly enriched for 
EMT 

The functions of the identified genes (in Figures 3, 4, and 
5) and the pathways which the genes potentially consti- 
tute are analysed using GeneGo Metacore from GeneGo 
Inc. and the Ingenuity Pathway Analysis (IPA, Ingenu- 
ity Systems, www.ingenuity.com). The genes identified as 
a result of the network inference step are significantly 
enriched for several biological functions. The top func- 
tions output from IPA that are known to be critical for 
EMT are gene expression, cellular development, cellu- 
lar growth and proliferation, cellular movement, and cell 
death. Moreover, several genes belong to the classes of 
invasion and migration. These classes are sub-categories 
of cellular movement, and they have been confirmed as 
the functional markers of EMT [59]. This suggests that 
many target genes and their interactive regulators are 
involved in EMT. Table 1 shows the genes in the class of 
invasion and migration together with their p-values. 

In addition, the pathways which the genes in our 
results potentially constitute are identified using GeneGo 

Table 1 Identified genes are significantly involved in the 
functional markers of EMT 

Functions Molecules Number p-value 

Invasion VDR, FGF8, miR-1 7-5p, 16 1.26E-06- 

5.4E-03 

miR-200a-3p, miR-429, miR-7-5p 
TIMP1,CDX2, ETS1,FLNA, 
F0SL1,SPDEF,TNK2,YY1 
ZEB1,ZEB2. 

Migration ESRRA, ETSl , F0SL1 , 24 3.59E-09 - 

7.96E-03 

PRKD1,SNAI2,SPDEF,TIMP1, 
TNK2, ABL2,CDX2, FGF8, 
FLNA, GRB7, let-7a-5p, miR-1 6-5p, 
nniR-17-5p, nniR-200a-3p, nniR-429, 
NFIC, PRAP1,PTPRU, RXRA, 
SREBF1,ZEB1 

A significant number of genes identified in the inference step belong to the 
class of invasion and migration which are EMT functional markers. The results 
are generated by IPA. 



Metacore. The statistically mapped pathways show that 
they are highly relevant to EMT. There are direct path- 
ways that regulate EMT, such as the pathway of devel- 
opment_microRNA-dependent inhibition of EMT. This 
pathway shows the regulation of the miR-200 family 
and other miRNAs on the EMT bio-markers ZEBl and 
ZEB2, and results in the inhibition and induction of EMT. 
Figure 6 shows the details of this pathway. Other direct 
pathways such as the development_TGF-beta-dependent 
induction of EMT via SMADs, and cell adhesion.tight 
junctions, are known to play critical roles in the regulatory 
procedure of EMT. The summary of these pathways and 
the corresponding /^-values are given in Table 2. 

Discussion 

During the past few decades, considerable efforts have 
been made to explore the transcriptional regulatory net- 
works in which transcription factors play the role as a 
main regulator. Other recent studies have investigated the 
post-transcriptional regulatory networks with miRNAs as 
the main regulator. However, with the ultimate goal of 
achieving a profound understanding of the mechanisms 
that control gene activities, it is sensible and desirable 
to find regulatory relationships involving both types of 
regulators from diverse sources of data. 

In this paper, we utilise Bayesian network learning in 
constructing the network, but the integrated global net- 
work in general is not a Bayesian network. For instance, 
one of the requirements for Bayesian networks is that 
the network structure must be a Directed Acyclic Graph 
(DAG). Our integrated global network may include some 
loops of interactions where two regulators interact with 
each other, hence it is not a Bayesian network. Such 
cyclic network behaviour is more reasonable in real- 
ity, as more and more feedback loops between miRNAs 
and TFs are being reported. For instance, the ZEB/miR- 
200 pair is a feedback loop that regulates EMT [60]. 
Therefore, the integrated global network may provide 
more information than normal Bayesian networks which 
are DAGs. 

In the network inference step, we use network motif 
finding algorithm to discover the sub-networks that recur 
at statistically significant level. Interestingly, the results 
from these small sub-networks still retain several impor- 
tant interactions and molecules relevant to the biological 
condition of the dataset. The relationships between the 
significance in frequency of graphs and biological func- 
tions are still open and interesting research topics. In the 
dataset used in this paper, the results are supportive for 
this hypothesis. An advantage of motif finding is that it 
produces a manageable number of interactions that can 
be used for further experimentation. The results from this 
paper, therefore, can provide good resources for future 
validating experiments. 
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Table 2 The statistically mapped pathways for EMT involve identified genes 



# 


Pathway 


p-value 


1 


Development_microRNA-dependent inhibition of EMT 


8.645E-17 


2 


Development_WNT signaling patliway. Part 2 


2.227E-05 


3 


Development_TGF-beta-dependent induction of EMT via SMADs 


7.325E-05 


4 


Cell adhesion.Chemokines and adhesion 


4.632E-04 


5 


Cell adhesion_Tight junctions 


1.614E-03 


6 


Cell adhesion_Role of tetraspanins in the integrin-mediated cell adhesion 


1 .748E-03 


7 


Development_TGF-beta receptor signaling 


4.154E-03 



The mapped pathways involve identified genes that are Important for EMT. The results are generated by GeneGo Metacore. 



While the network motifs found based on the regulatory 
network may provide useful patterns to guide biologi- 
cal experiments, these motifs depend on the structure of 
the regulatory network. The structure of the regulatory 
network in this paper is obtained based on the assump- 
tion that miRNAs and TFs are regulators and mRNAs 
are targets. However, the knowledge of gene regulatory 
relationships is still limited and the assumption may not 
always hold in reality. When the structure of the regula- 
tory network changes the resulting network motifs may 
change too. 

In the paper, we use the differentially expressed genes 
as the nodes for the gene regulatory network. We assume 
that genes whose expression levels do not change signif- 
icantly between conditions would not play an important 
role in the regulatory network. There may be the case 
that a gene is the target of two regulators that cancel 
out each other, resulting in the expression level of the 
target gene unchanged. However, to make our method 
computationally practical we do not consider such cases. 

To start the process of Bayesian network structure 
learning, target information is used to initialise the net- 
work. The target information based on sequence data 
may involve false discoveries. Bayesian network struc- 
ture learning uses gene expression data to evaluate the 
confidence level of each interaction (edge) in the initial 
network, and only the interactions of high confidence 
are integrated into the global network. Therefore, graph- 
ically the set of edges in the global network is a subset 
of the set of edges in the initial network. The enrich- 
ment analysis shows that the important interactions for 
EMT are retained in the global network, demonstrating 
the effectiveness of the method. Other high-confidence 
interactions provide strong hypotheses for experimental 
validations. 

Conclusions 

In this study, we have proposed a framework for inferring 
complex gene regulatory networks using diverse sources 
of data, including target information for regulators, 
expression profiles, and sample categories. The interplay 



between regulators and the motifs with which they reg- 
ulate target genes are revealed in the three-component 
network, and it is impossible to infer the interplay from 
any single regulator regulatory networks. The analysis of 
the EMT datasets has produced several results that have 
been validated by literature, a number of statistically sig- 
nificant interactions between miRNAs and TFs, and novel 
network motifs. 
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